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DIMENSION REDUCTION BASED ON CONSTRAINED 
CANONICAL CORRELATION AND VARIABLE FILTERING^ 

By Jianhui Zhou and Xuming He 
University of Virginia and University of Illinois 



The "curse of dimensionality" has remained a challenge for high- 
dimensional data analysis in statistics. The sliced inverse regression 
(SIR) and canonical correlation (CANCOR) methods aim to reduce 
the dimensionality of data by replacing the explanatory variables with 
a small number of composite directions without losing much informa- 
tion. However, the estimated composite directions generally involve 
all of the variables, making their interpretation difficult. To simplify 
the direction estimates, Ni, Cook and Tsai [Biometrika 92 (2005) 242- 
247] proposed the shrinkage sliced inverse regression (SSIR) based on 
SIR. In this paper, we propose the constrained canonical correlation 
(C^) method based on CANCOR, followed by a simple variable filter- 
ing method. As a result, each composite direction consists of a subset 
of the variables for interpretability as well as predictive power. The 
proposed method aims to identify simple structures without sacrific- 
ing the desirable properties of the unconstrained CANCOR estimates. 
The simulation studies demonstrate the performance advantage of the 
proposed method over the SSIR method. We also use the proposed 
method in two examples for illustration. 



1. Introduction. For data sets with a large number of variables, the well- 
known "curse of dimensionality" poses a challenge to most statistical meth- 
ods. Dimension reduction methods are often used to reduce dimensionality, 
enabling regression or classification to be performed in a parsimonious way. 

We consider a regression setting where the univariate response y is related 
to the explanatory variable Xpxi through K linear combinations of x and 
an unknown function /. One way to describe the model is 
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where e is the random error independent of x, K is the smallest integer for 
model (1) to hold, and Pi are a set of effective dimension reduction (e.d.r.) 
directions as in [11]. Following [2], the subspace Sy^^ = span{/3i, . . . , /S^^} C 
is called the central dimension reduction subspace (DRS). The central 
DRS is unique, even though the sets of e.d.r. directions, which span the 
central DRS, can be taken differently when K >2. Given the central DRS 
Sy\xj we use the /C-dimensional projection of x onto 5^1^^^, instead of x itself, 
in the model. 

To estimate the dimensionality K and the central DRS, the sliced inverse 
regression (SIR) method is proposed by [11] to summarize the explanatory 
variables into a smaller number of linear projections. Compared to the prin- 
cipal component analysis that summarizes the information contained in the 
explanatory variables, SIR takes into account extra information contained 
in the response variable. The sliced average variance estimation (SAVE) 
method is proposed by [5], and it is shown that it captures a larger portion 
of the central DRS than SIR by [4]. To achieve exhaustive estimation of 
the central DRS, simple contour regression (SCR) and general contour re- 
gression (GCR) are proposed by [9]. Assuming an additive random error in 
model (1), the (conditional) minimum average variance estimation (MAVE) 
method is proposed in [20] by minimizing an objective function that involves 
both the direction estimation and the nonparametric function estimation. 
Other dimension reduction methods in the literature include the ordinary 
least squares (OLS) by [14] , the principal Hessian directions (PHD) by [12], 
and the canonical correlation (CANCOR) method by [8]. 

When K < p, a. reduction in dimensionality is achieved through model 
(1), but each effective direction usually involves all of the explanatory vari- 
ables. When p is large and the variables in x are of different scales, the 
estimated linear combinations x'^Pi are difficult to interpret, which makes 
them less useful in the analyses following dimension reduction. Attempts 
have been made to address this problem. For single-index models, which are 
a special case of model (1), an AlC-based criterion is proposed by [16] to 
select the relevant variables and the smoothing parameter for estimating the 
unknown function simultaneously. In the general framework of model (1), 
the shrinkage sliced inverse regression (SSIR) method is developed in [17] by 
employing the LASSO approach of [19]. In SSIR, the central DRS Sy^^ is es- 
timated by span{diag(d)i?}, where span{M} denotes the subspace spanned 
by the columns of the matrix M, 13 corresponds to the estimated central 
DRS spanji?} of SIR in [3], and the shrinkage indices a are determined 
through a LASSO regression subject to an Li-norm constraint on a. 

In this paper, we propose an approach to reduce the number of variables 
appearing in the CANCOR directions, aiming to explore possible sparsity 
in the e.d.r. directions through both dimension reduction and variable se- 
lection. We describe the constrained canonical correlation (C'^) method in 
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Section 2 by imposing the Li-norm constraint on the e.d.r. direction esti- 
mates in CANCOR. In Section 3, a variable filtering procedure is proposed 
to threshold the estimated coefficients to reduce the number of nonzero co- 
efficients in the direction estimates by C^. The final values of the nonzero 
coefficients are then re-estimated by CANCOR on the relevant subset of 
variables. Simulation studies are conducted in Section 4, and two data sets, 
the car price data and the Boston housing data, are analyzed in Section 5. 
Concluding remarks about the proposed method can be found in Section 6. 

2. Constrained canonical correlation. 

2.1. CANCOR revisited. Using the B-spline basis functions generated 
for the response variable, the CANCOR method, which is asymptotically 
equivalent to SIR, is proposed by [8]. Suppose that the range of y is a 
bounded interval [a, h] . Given A;„, internal knots in [a, b] and the spline order 
m, we generate m + kn B-spline basis functions. Under the linearity condi- 
tion of [11], CANCOR estimates a set of e.d.r. directions by estimating the 
canonical variates between the B-spline basis functions and x. Since the gen- 
erated m + kn B-spline basis functions sum to 1 , we use in CANCOR the first 
m + kn — l basis functions of y, 7r{y) = {iTi{y), . . . , TTm+k„-i{y))'^ ■ Given n ob- 
servations, {Yi,Xi), of the random variables (y, x), let Xnxp = {Xi , . . . , 
and n„x(m+fc„-i) = ('''"(^i)i • • • i7r(l^))^ be the two data matrices containing 
the predictor values and the B-spline basis function values. The CANCOR 
method is then to estimate the canonical correlations between the columns of 
X and the columns of 11. The dimensionality of the central DRS is selected by 
performing the following sequential tests on the number of nonzero canonical 
correlations, -ffo,s : 7s > 7s+i = versus Hi s '■ 7s+i > for s = 0, 1, . . . ,p — 1, 
where 7^ are the asymptotic canonical correlations between 7r(y) and x in 
decreasing order. For details on the test statistic, see [8]. The dimensionality 
estimate K is the smallest s such that -ffo,s is not rejected. The directions 
Pi in the estimated canonical variates x'^Pi, corresponding to the nonzero 
correlations, are the estimated e.d.r. directions. The estimated central DRS 
is S'y|a, = span{ /?!,..., 

— 1/2 

Using the standardized variables z = T,xx [x — E{x)], where E{x) and T,xx 
are mean and covariance of x, it is shown in [8] that CANCOR is based on 
Cov[E{z\y)], which is the same as the kernel matrix in SIR. Letting {Xi,T]i), 
i = 1,2, . . . ,p, be the eigenvalues in decreasing order and the corresponding 
eigenvectors of the matrix Cov[E(z\y)], the canonical correlations between 
Tri{y) and x are 7^ = X]^"^ , and the canonical directions Pi = S^i^^ryj, corre- 
sponding to the nonzero eigenvalues Aj, are contained in the central DRS 
Sy^xi assuming the linearity condition of [11]. 
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The CANCOR method actuaUy solves an optimization problem that se- 
quentially finds the directions /3i with the maximum correlations between 
x'^Pi and some functions of y. If the canonical correlations are strictly mono- 
tone, then the CANCOR approach leads to the unique identification of the 
directions f3i, not just the space spanned by those directions. The directions 
Pi have their own interpretations, regardless of whether model (1) holds. 

2.2. Constrained CANCOR. To estimate the first constrained e.d.r. di- 
rection, we solve the following nonlinear constrained canonical correlation 
(C^) problem, where will be used throughout the paper for the Li- 

norm of any vector /3. 

Problem 1. 

Maximize a^Tij^xP^ 
subject to a^t^T,a = l; P'^txxP = l; ||/3||Li<i- 

In Problem 1, the matrix Syrx is taken to be the sample covariance matrix 
between 7r{y) and x, and Stttt and S^x the sample variance-covariance ma- 
trices of TT{y) and x, respectively. The maximizer of Problem 1 for a given t 
is denoted as and the maximal correlation is 7f = cif-^STrx/^i- The 

criterion of selecting the value of t will be discussed later in this section. 

One salient point about Problem 1 is that it does not use the inverse 
of the sample covariance matrix 'Sxx- Even if the number of variables p is 
greater than n, the constrained problem will generally have a unique solu- 
tion for small t, while the unconstrained problem admits possibly infinitely 
many solutions. When there are multiple candidates to achieve the same cor- 
relation, the constrained optimization favors the candidate directions with 
sparse coefficients. 

For the identifiability of the ith (i > 2) constrained e.d.r. directions, we 
require them to be uncorrelated with the previously estimated directions, 
that is, the ith (i > 2) constrained e.d.r. direction is estimated by solving the 
following constrained optimization problem for (q;^,/3?), and 7? = af^'ETrxPi- 

Problem 2. 

Maximize a^Ti-j^xP, 
subject to a^S„^a = l; P'^txxP = l; \\P\\Li<t; 
a^t^^af = 0; P'^±xxPf = 0; l = l,...,i-l. 

Starting from Problem 1 and solving Problem 2 iteratively over i, we get a 
set of the constrained e.d.r. direction estimates {Pf} for i = 1, . . . , K , where 
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Fig. 1. Constraint effect in 2D space. 

K is the number of significant e.d.r. directions determined by the sequential 
tests in CANCOR or any pre-determined dimensionahty for the problem. 
The tuning constant t is expected to vary with i, as detailed in the next 
subsection. 

Note that without the Li-norm constraint < i, Problems 1 and 2 

indeed estimate the canonical correlations between 7r(y) and x. Under the 
constraint, some coefficients in the direction estimates /3f shrink toward 0. 
To illustrate that the Li-norm constraint favors sparse coefficients of /3f , we 
discuss the effect of the constraint in Problem 1 in the 2-dimensional space 
as shown in Figure 1. 

In Figure 1, we assume S^x = I2 and /? = {/3^,/3'^). The circle is the contour 
defined by = iP^)"^ + (Z^^)^ = !> and the squares are defined by + 

|/?2| = t for t = 1.5, 1.2 and 1.0, respectively. In CANCOR, we search for the 
estimate of /? on the whole circle to maximize the correlation. In C^, the 
feasible region of (3 is the part of the circle lying in the square defined by 
t. When t gets smaller, the coefficients of the points in the feasible region 
are closer to being sparse. As shown in Figure 1, when t is decreased to 1, 
the feasible region shrinks to the four points (±1,0) and (0, ±1). Similarly 
in the general p-dimensional space, when t gets smaller, the points in the 
feasible region defined by t move toward sparser coordinates. 

For selecting the tuning parameter t in Problem 1 or 2, we first study 
the range of t in which the Li-norm constraint takes effect. When each 
explanatory variable in x G is standardized to vfiricince 1, the nicttrix 
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has diagonal elements 1 and off-diagonal elements between —1 and 1. Thus, 
we have > P'^'SxxP = 1- When t = 1, only the unit vectors, such as 

(0, . . . , 0, ±1, 0, . . . , 0), are in the feasible region. When t < 1, the feasible 
region is empty and the optimization problem does not have a solution. 
Therefore, the tuning parameter t should not be less than 1 for a meaningful 
Li-norm constraint. On the other hand, when t > to, where to = ||/3j||Li and 
f3i is the unconstrained e.d.r. direction estimate by CANCOR, the maximizer 
is always Pi. Therefore, the tuning parameter t should be in the interval 

2.3. Computational issues. Starting from t = to, we search for the value 
of t in each problem iteratively by decreasing t by a small amount At in each 
iteration. Given the value of t in each iteration, the constrained direction f3f 
and the maximum correlation ■yf are estimated. When t gets smaller, more 
coefficients in (3f tend to shrink toward 0. As t is decreased to 1, only one 
variable will be involved in the estimated I3f. However, we usually need 
to stop decreasing t before it reaches 1, because the maximum correlation 
may have decreased too much when t is too small. To balance the simplic- 
ity of the estimated direction with the loss in the correlation, we propose 
to stop decreasing t when the maximum correlation drops below a lower 
confidence limit of the canonical correlation. The limiting distributions of 
the sample canonical correlations can be found in [15]. However, those dis- 
tributions do not apply directly in our setting. The positive news is that 
the accuracy (in confidence level) of such a confidence limit is not so crit- 
ical for our purpose, and we prefer to use easy approximations based on 
the following Fisher's transformation. Given the estimated canonical corre- 
lation 7i, we transfer it to pi = (l/2)log[(l + 7i)/(l — 7^)], whose distribution 
could be approximated by the normal distribution with standard deviation 
l/^/^T^. Thus, we use Pi — Zi_q/-v/?i — 3 as the 100(1 — a)% lower con- 
fidence limit of the transformed canonical correlation, where Zi^a is the 
100(1 — a)% quantile of the standard normal distribution. Accordingly, the 
lower confidence limit of the ith canonical correlation could be approxi- 
mated by (exp(2Tj) — l)/(exp(2rj) + 1), where Ti = pi — Zx^al^/n — 3. We 
stop decreasing t in the iterative process when the corresponding maximum 
correlation falls below the lower confidence limit given above. The value of 
t in the iteration just before the above process is stopped is selected for 
Problem 1 or 2, and the corresponding direction estimate /3f is taken as the 
estimated ith constrained e.d.r. direction. The two parameters. At and a, 
are user-specified. In this paper, we use as default a = 0.005 and At = 0.05. 

Problems 1 and 2 are nonlinear constrained optimization problems with 
an Li-norm constraint. Most of the existing algorithms for solving the con- 
strained optimization problems need a pair of initial directions for a and /?, 
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and there is no guarantee that the global maximum will be found through 
iteration. In the method, we use the unconstrained e.d.r. directions Oj 
and Pi estimated by CANCOR as the initial directions for solving Problems 
1 and 2. In the iterative process for selecting the value of t, the constrained 
direction estimates in the previous iteration are used as the initial directions 
in the next iteration when t is deceased by At. Since the unconstrained e.d.r. 
directions correspond to the global maxima, the proposed scheme greatly re- 
duces the chance of hitting a local maximum in the iterations. Furthermore, 
we always know what to expect from the maximum correlation, so it is easy 
to know when a poor local maximum is found. 

2.4. Consistency. Throughout we assume that T^xx > 0, and that the 
nonzero eigenvalues of Cov[E{z\y)] are distinct, that is, 71 > 72 > • • • > 7/< > 
0. In this setting, the canonical directions Pi {i < K) are uniquely identified, 
in contrast to those in model (1). Obviously, the directions Pi defined through 
CANCOR are not necessarily part of those in model (1), unless a well-known 
linearity condition holds on the distribution of x (see [11]). We believe that 
the composite directions Pi are useful even when model (1) does not hold. 
Letting /c„ be the number of internal knots used in the spline, we have: 

Theorem 1. Assume that Ti^x > 0, 71 > • • • > 7/^ > for some K, and 
that the following conditions Ai -A4 hold. 

Ai : The marginal density of y is bounded away from and infinity on 
[a,b]. 

A'}-, 00 and kn = o(n^/^) as 00. 

aI- E{\\x\\^)<oo. 

Ai^: Each component of E{z\y) is a function on [a, with hounded deriva- 
tive; 

Then we have PI — > Pi for i < K. 

Note that conditions A\-Ai^ are sufficient for the consistency of the con- 
strained canonical correlation estimates, but the assumption of distinct eigen- 
values 7j, i = 1, . . . , i^T is there to ensure the consistency of the constrained 
direction estimates. Since the direction estimates from will not be taken 
as the final estimates, we shall defer the results on asymptotic distributions 
to the next section. 

3. Variable filtering and final estimates. As demonstrated in Figure 1, 
the Li-norm constraint helps to shrink the e.d.r. direction estimates toward 
the sparse regions in M^, but exact zero coefficients in PI are not aimed 
at. To remove the variables with no or very marginal effects on the e.d.r. 
directions, we propose to threshold the coefficients in /3f through a variable 
filtering procedure. 



8 



J. ZHOU AND X. HE 



The proposed variable filtering procedure is simple and iterative. At each 
iteration, one more coefficient of Pf is set to be according to the magni- 
tudes of the coefficients in absolute value. Recall that the x variables are 
standardized individually. Given the constrained e.d.r. direction estimates 
a? and Pf from Problem 1 or Problem 2, the proposed variable filtering 
procedure involves the following steps: 

1. Let d = p. 

2. Define a new direction P'f{d) by keeping the largest d coefficients of (3f 
in absolute value and setting the other (p — d) coefficients to be 0. Find 
/5f (d) as the projection of I3'f{d) into the space the set of all (5 such 
that: (i) = 0, j < i - 1, (ii) the set of zero coefficients in /3 is the 
same as that in P'f{d) and (iii) = 1- 

3. Compute the correlation, = Corr(7r(y)'^d|, rE-^/3f (d)), and the BIC-type 
criterion, BIC{d) = ?ilog(l — r^) + d\og{n). 

4. Let d = d — l. Repeat steps 2-4 until d = 0. 

Performing the above variable filtering procedure, we get a sequence of 
BIC[d) as d decreases from p to 0. Let do be the integer at which BIC{d) is 
minimized. Then, the p — do smallest coefficients of /3f in absolute value are 
set to 0. This proposed variable filtering procedure is a simplified variable se- 
lection procedure. In variable filtering, at most p possibilities are considered, 
which makes it feasible to do even when p is large. 

Whenever d < i in Step 2, the projection /9f (d) is usually a zero vector, and 
in those cases we set = 0. Therefore, the number of nonzero coefficients 
for Pi will be no fewer than i by our variable filtering algorithm. This is not 
an undesirable feature, as it might be unwise to be aggressively dropping 
variables if a large number of dimensions will be needed in the model. In 
most applications, we may be looking for the cases of K <^p; the restriction 
of d > z on the results for /?j is mild. 

It is important to note that the proposed variable filtering procedure 
should be applied to the constrained direction estimates by the method. 
If it is applied directly to the unconstrained direction estimates, it will not be 
effective, as shown in the first example in Section 5, mainly because the sizes 
of the coefficients are often "distorted" by collinearity of the variables in the 
unconstrained direction estimates. Variable selection in the unconstrained 
problem would be a harder problem. 

To aim for higher correlation, we propose re-estimation of the directions 
after variable filtering: the variables selected for any of the first K con- 
strained directions are combined, and the CANCOR method is performed 
on those variables alone. Then, the final estimate of Pi, now denoted as p{ , 
is obtained by taking the estimates from re-estimation for its nonzero co- 
efficients and by keeping the zero coefficients as determined in the variable 
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filtering stage. The following theorem concerns the asymptotic behavior of 
the final direction estimates. 

Theorem 2. Assume that Ti^x > and 71 > • • • > > for some K. 

(i) Under conditions A1-A4 in Theorem 1, the proposed filtering pro- 
cedure will select all of the variables with nonzero coefficients in (3i (i = 
1,2, ... ,K) with probability tending to 1 as n ^ 00. 

(ii) Suppose Pi = /3i,2; ■ • where is a vector that contains the 
nonzero coefficients of Pi and Pij = (j >2) are the remaining individual 
components. If condition A2 is strengthened to the following A2, ^n/"- ~^ 
and k'^/n 00; then for i = 1, . . . , K , \/n{p{-^^ — Pi^i) has the same limiting 
distribution as y/n[Pi^i — Pi^i), and \/n\p( ^ ~ stochastically dominated 
by y/n\Pij - Pij\ for any j > 2. 

We refer to [8] for the limiting distribution of y/n[Pi — Pi) under the un- 
constrained CANCOR method. In parametric settings, penalized likelihoods 
with Li constraints have been studied by earlier authors such as [6] and [7]. 
Here, a measure of correlation takes the role of a likelihood, and the non- 
parametric component (in the case of fc„ increasing with n) complicates 
the analysis. Indeed, the proposed dimension reduction is a combination of 
constrained canonical correlation {C^), variable filtering, and re-estimation. 
The step in moves the estimated direction closer to Pi under sparsity, 
which enables us to perform model selection through a simple variable filter- 
ing scheme. Re-estimation following variable filtering updates the direction 
estimates to achieve the highest possible correlation with a function of y. 
Theorem 2 shows that our proposed direction estimates are designed to ex- 
plore the sparsity in Pi with the same or better asymptotic efficiency than 
the CANCOR estimates. 

The BIC-type criterion used in variable filtering is chosen to mimic the 
Bayesian information criterion, but it cannot be identified as the exact or 
approximate BIG in the usual sense. Rather, this choice is in part based on 
our empirical experience. The asymptotic results in Theorem 2 remain valid 
for a wide range of model selection criteria that balance correlation with 
complexity d. 

4. Simulation studies. We perform 4 simulation studies in this section to 
study the performance of the proposed dimension reduction method, which 
consists of C^, variable filtering and re-estimation. For convenience, in the 
simulation studies and in the examples in Sections 4 and 5, we will simply 
refer to the proposed dimension reduction as the method. To compare 
the method with the shrinkage sliced inverse regression (SSIR) method 
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Table 1 







Summary 


of Study 1 








Method 
Criterion 










SSIR 




a = 0.01 


a = 0.005 


GCV 


AIC 


BIC 


RIC 


Sample size 






71 = 60 








A3 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


(SE) 


(0.00) 


(0.00) 


(0.00) 


(0.00) 


(0.00) 


(0.00) 


A21 


20.73 


20.80 


20.00 


19.94 


20.95 


20.99 


(SE) 


(0.06) 


(0.05) 


(0.11) 


(0.11) 


(0.02) 


(0.01) 


Sample size 






n= 120 








A3 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


(SE) 


(0.00) 


(0.00) 


(0.00) 


(0.00) 


(0.00) 


(0.00) 


A21 


21.00 


21.00 


20.35 


20.34 


20.99 


21.00 


(SE) 


(0.00) 


(0.00) 


(0.08) 


(0.08) 


(0.01) 


(0.00) 



proposed in [17], we specify the settings of the first 3 studies to be the same 
as those in their paper. 

As reviewed in Section 1, SSIR estimates a shrinkage index C = (Ci) ■ • ■ i Cp) 
subject to IICIIli < i- The central DRS Sy\x is estimated by span{diag(C)-B} 
in SSIR, where C, is the estimate of C by SSIR and B is the matrix with 
columns as the e.d.r. direction estimates by SIR. The estimated shrinkage 
indices Q compress some rows of B to 0, and the corresponding variables 
will not be involved in the dimension reduction results. There are four cri- 
teria used in SSIR for selecting the value of their tuning parameter t: the 
generalized cross validation (GCV) criterion, Akaike's information criterion 
(AIC), the Bayesian information criterion (BIC) and the residual informa- 
tion criterion (RIC). We refer to [17] for details on SSIR and those criteria. 

In implementing C^, we varied a in the set {0.025,0.01,0.005,0.0025} 
in the selection of the tuning parameter t, but found that the results were 
quite robust. For brevity, the results with a = 0.01 and 0.005 are reported 
here. To solve the optimization problems, we used an R interface with the 
FORTRAN subroutine VF13AD in the Harwell Subroutine Library. Other 
packages such as PROC NLP in SAS may be used for the same purpose. 

In each study, we generate 100 data sets of the sample size 60 or 120. To 
generate the B-spline basis functions '7r(y), the quadratic spline (order m = 3) 
with A:„ = 4 internal knots is used. Accordingly, we use / = 7 slices in SSIR 
such that m + kn = I- Since the explanatory variables are generated from 
normal distributions, the linearity condition holds in each study. The average 
numbers of coefficients in the estimated constrained e.d.r. directions are 
summarized. 
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Study 1. 

y = xi + X2 + X3 + 0.5e, 

where x = (xi, . . . ,X2a)^ ~ -/V(0, /24)> e ~ -^(0, 1), and x and e are indepen- 
dent. In this study, the true direction is /3i = (1, 1, 1,0, . . . ,0)"^ with 21 zero 
coefficients. Table 1 summarizes the average number of zero coefficients, as 
well as the corresponding standard errors, in diag(Q;)/3i estimated by SSIR 
and in f^l estimated by C^, where: 

• ^3 is the average number of zero coefficients out of the first 3 coefficients 
in diag(d)/3i for SSIR or in for C^; 

• A21 is the average number of zero coefficients out of the last 21 coefficients 
in diag(a!)/3i or in 

Note that for the true direction /3i, we have ^3 = and A21 = 21. 
Study 2. 

?/ = xi/{0.5 + (2;2 + 1.5)2} + 0.2e, 

where a; = (xi, . . . , ~ A^(0, 124), e~-^(0,l), and x and e are inde- 

pendent. In this study, the true central DRS is the subspace spanned by 
(5i = (1,0,..., 0)^ and 02 = (0, 1, 0, . . . , 0)^. Letting S = diag(Q)(/3i, /32) es- 
timated by SSIR and = {M^M) estimated by C^, Table 2 summarizes 
the averages A2 and A22, where: 

• A2 is the average number of zero rows out of the first 2 rows in B for 
SSIR or in for C^; 

• j422 is the average number of zero rows out of the last 22 rows in B or in 
Bf. 

Here, a zero row is a row vector with all elements equal to 0. Note that for 
the matrix B = (/3i,/32) corresponding to the true central DRS, ^2 = and 
A22 = 22. 

Study 3. 

y = xi/{0.5 + (x2 + 1.5)2} + 0.2e, 

where e ~ iV(0,l), x G N{0,Y.) with Sj^ = O.Sl*"-''! for 1 < i,j < 24, and 
e and x are independent. The averages A2 and A22 defined in Study 2 are 
summarized in Table 3 with A2 = and A22 = 22 corresponding to the true 
central DRS. 
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Table 2 







Summary 


of Study 2 








Method 
Criterion 






SSIR 




a = 0.01 


a = 0.005 


GCV 


AIC 


BIC 


RIC 


Sample size 






71 = 60 








A2 


0.14 


0.14 


0.02 


0.02 


0.07 


0.17 


(SE) 


(0.03) 


(0.03) 


(0.01) 


(0.01) 


(0.03) 


(0.04) 


A22 


18.48 


18.93 


6.70 


6.27 


14.23 


18.62 


(SE) 


(0.23) 


(0.17) 


(0.25) 


(0.25) 


(0.26) 


(0.16) 


Sample size 






n = 120 








A2 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


(SE) 


(0.00) 


(0.00) 


(0.00) 


(0.00) 


(0.00) 


(0.00) 


A22 


19.44 


19.45 


7.66 


7.29 


15.48 


19.80 


(SE) 


(0.14) 


(0.15) 


(0.27) 


(0.27) 


(0.23) 


(0.14) 


Study 4. 




y = xi-\ 


+ X24 + 0.5e, 









where x = {xi, . . . ,X24)"^ ~ N {0,124), e ~ A^(0, 1), and x and e are indepen- 
dent. The true direction is /3i = (1, 1, . . . , 1)^. Table 4 summarizes A that is 
the average number of zero coefficients in diag{a)(3i for SSIR or in p( for 
C^. Note that for the true direction Pi, we have A = 0. 



In Studies 1-3, the residual information criterion (RIC) has the best over- 
all performance among the four criteria of SSIR in identifying the relevant 

Table 3 







Summary 


of Study 3 








Method 
Criterion 










SSIR 




a = 0.01 


a = 0.005 


GCV 


AIC 


BIC 


RIC 


Sample size 






71 = 60 








A2 


0.27 


0.33 


0.07 


0.07 


0.18 


0.22 


(SE) 


(0.04) 


(0.05) 


(0.03) 


(0.03) 


(0.04) 


(0.04) 


A22 


18.94 


19.37 


5.79 


5.28 


14.35 


18.52 


(SE) 


(0.18) 


(0.18) 


(0.28) 


(0.28) 


(0.26) 


(0.17) 


Sample size 






n = 120 








A2 


0.18 


0.25 


0.01 


0.01 


0.03 


0.06 


(SE) 


(0.04) 


(0.04) 


(0.01) 


(0.01) 


(0.02) 


(0.02) 


A22 


20.10 


20.22 


6.77 


6.29 


14.65 


18.62 


(SE) 


(0.11) 


(0.10) 


(0.24) 


(0.24) 


(0.24) 


(0.16) 
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Table 4 
Summary of Study 4 



Method 
Criterion 










SSIR 




a = 0.01 


a = 0.005 


GCV 


AIC 


BIC 


RIC 


Sample size 






n = 60 








A 


0.00 


0.00 


6.82 


6.39 


13.61 


17.36 


(SE) 


(0.00) 


(0.00) 


(0.21) 


(0.21) 


(0.18) 


(0.17) 


Sample size 






n= 120 








A 


0.00 


0.00 


0.15 


0.09 


5.91 


13.81 


(SE) 


(0.00) 


(0.00) 


(0.04) 


(0.03) 


(0.24) 


(0.19) 



explanatory variables and filtering out the irrelevant variables. The proposed 
method works competitively to SSIR with RIC. However, in Study 4 
where all of the variables are relevant, the method misses none of them 
while SSIR with RIC misses more than half of them on average. In Study 
4, the AIC and GCV criteria in SSIR outperform BIC and RIC, although 
they still do not perform as well as the proposed method. 

According to Shi and Tsai [18], the RIC criterion has a greater penalty 
function on the effective number of parameters than the AIC and BIC cri- 
teria. Thus, as shown in the simulation studies, SSIR with RIC usually 
compresses more coefficients to be than SSIR with AIC or BIC. There- 
fore, in Studies 1-3 when only a small number of variables are involved in 
the central DRS, the RIC criterion helps filter out the irrelevant variables, 
but in Study 4 when all of the variables are involved, the RIC criterion is 
too aggressive, underscoring the difficulty in doing well for both types of 
situations. In C^, instead of just penalizing the number of parameters, we 
monitor the decrease in the correlation, the objective function in CANCOR, 
which has a very intuitive appeal and makes it easy to outperform other 
criteria. In Study 4, excluding any of the variables will drive the correlation 
out of the lower limit of the confidence interval. As a result, correctly 
picks up all of the variables. 

In summary, the selection criteria in SSIR have varied performance, but 
the proposed method competes favorably with the best criterion used 
in SSIR, and has a smaller risk for being overly aggressive. The simulation 
studies also show that the performance of both SSIR and can be improved 
as the sample size increases. 

5. Examples. In this section, we apply the proposed method to two 
studies, the car price data and the Boston housing data. 

5.1. Example on car prices. In the car price data analyzed in [16], the 
nonnegotiable transaction prices (y) of 25 different family saloons, together 
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with their nine attributes, are recorded. Those nine attributes are mileage 
per gallon (xi), horsepower (X2), length (xs), width (X4), weight (3:5), height 
(xq), satisfaction (X7), reliability (xg) and overall evaluation (xg). Applying 
to the car price data, we aim to select a subset of the attributes to explain 
the variation among the transaction prices. 

The nine attribute variables are standardized to mean and variance 1. 
For estimating the dimensionality K in model (1), the sequential tests are 
performed with different numbers of slices (/) in SIR and different numbers 
of internal knots (fc„) with quadratic spline (order m = 3) in CANCOR. To 
make the estimates K by SIR and CANCOR comparable, the numbers / 
and kn are matched such that I = m -\- kn- The dimensionality estimates by 
SIR and CANCOR with different / and kn, respectively, are summarized in 
Table 5. The estimate by SIR is sensitive to the number of slices while the 
estimate by CANCOR is robust against the number of internal knots. Thus, 
we select the CANCOR estimate K = 1, and estimate the first constrained 
e.d.r. direction using C^. For generating the B-spline basis functions based 
on y, the quadratic spline with A:„ = 4 internal knots is used. 

The first canonical correlation corresponding to the direction estimate 
Pi by CANCOR is 0.950 with the approximated 99.5% lower confidence 
limit 0.858. The iterative process described in Section 2 selects the tun- 
ing parameter t = 1.10. The corresponding constrained direction estimate 
at t = 1.10 yields the correlation 0.872, and has two nonzero coefficients se- 
lected by the variable filtering procedure. Using the two selected variables, 
the constrained direction is re-estimated as described in Section 3. The re- 
estimated constrained direction p( yields the correlation 0.905. The direction 
estimates /3i and p( by CANCOR and C^, respectively, are standardized to 
unit length, and are shown in Table 6. 



Table 5 

Dimensionality estimates for the car price data 







SIR 




CANCOR 




I 
k 


3 
1 


4 5 6 7 8 
2 10 


fcn 

k 1 


1 2 3 
1 1 1 


4 5 
1 1 






Table 6 

Direction estimates by CANCOR and 


for the 


car price data 








X2 X3 X4, ajs 


Xe 


xr Xs 


X9 


Pi 
Pi 


0.056 



0.329 -0.472 0.434 -0.032 
0.588 


0.136 



0.647 -0.142 
0.809 


-0.138 
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Back to the original scale of the variables, the estimated linear combi- 
nation corresponding to p( is 0.046x2 + 0.999x7. The selected variables are 
horsepower (X2) and satisfaction (xy) by the proposed method. In [16], 
the SIR method is applied to the car price data, and the variables are or- 
dered according to the direction estimate by SIR. Based on the ordering, a 
total number of 9 nested models are considered, and an AlC-based model 
selection criterion for single-index models is applied to the 9 nested models, 
resulting in the selected model y = ^(0.045x2 + 0.999x7), where the function 
g{-) is estimated by applying the local polynomial regression with the se- 
lected bandwidth. The direction estimated in [16] is almost identical to the 
one estimated by C^. However, the method does not require an estima- 
tion of g, and it selects (rather than assumes) a uni-dimensional model that 
is adaptive to the data. 

If we apply the variable filtering procedure to the CANCOR direction 
estimate (3i, we end up with 4 variables with nonzero coefficients, X2, X3, X4 
and X7. On the original scale of the variables, the corresponding estimated 
linear combination is 0.041x2 — 0.171x3 -|- 0.471x4 -|- 0.864x7. Given that X3 
and X4 have large coefficients in /3i , they are forced in by the variable filtering 
procedure. However, when the method is performed, the coefficients of 
X3 and X4 decrease as the tuning parameter t decreases, while the coefficient 
of X2 increases until t is very close to 1. When the iterative process stops at 
t = 1.10, the coefficients of X3 and X4 are very close to in the constrained 
direction estimate. As a result, the variable filtering procedure does not 
select X3 and X4. This example confirms that when the explanatory variables 
are correlated, applying the simplified variable filtering procedure given in 
the paper to the unconstrained direction estimates would not be effective, 
and a fuller version of variable selection would be needed. 

5.2. Example with Boston housing data. We now analyze the Boston 
housing data using the proposed method. Although this data set has been 
widely analyzed from different perspectives, our purpose is to demonstrate 
the difference between constrained and unconstrained direction estimates. 
The Boston housing data contains 506 observations, and can be downloaded 
from the web site Ohttp://lib. stat.cmu.edu/datasets/boston_corrected.txt. 
The dependent variable y is the median value of owner-occupied homes in 
each of the 506 census tracts in the Boston Standard Metropolitan Statistical 
Areas. The 13 explanatory variables are per capita crime rate by town (xi); 
proportion of residential land zoned for lots over 25,000 sq.ft (X2); proportion 
of nonretail business acres per town (X3); nitric oxides concentration (X4); 
average number of rooms per dwelling (X5); proportion of owner-occupied 
units built prior to 1940 (xg); weighted distances to five Boston employment 
centers (X7); full- value property-tax rate (xs); pupil-teacher ratio by town 
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Table 7 








Direction estimates b 


y CANCOR and for the 


Boston housing data 








02 


/3f 


Pi 


Xl 




— n nfi4 



u 



u 


X2 


u.uou 


— u.zuu 


n 
u 


n 
u 


X3 


-0.078 


-0.015 








X4 


-0.009 


-0.352 








X5 


0.871 


-0.565 


0.962 


-0.645 


xe 


-0.306 


-0.058 


-0.174 


-0.096 


X7 


-0.291 


-0.149 








Xa 


-0.165 


0.022 


-0.166 





X9 


-0.125 


-0.041 


-0.126 





Xio 


-0.005 


0.089 








Xii 


0.008 


-0.644 





-0.758 


Xl2 


0.043 


0.108 








Xl3 


0.039 


0.143 









(xg); proportion of blacks by town (xio); percentage of lower status of the 
population (xn); Charles River dummy variable (X12); index of accessibility 
to radial highways (xis). 

For observations with crime rate greater than 3.2, the variables X2, X3, xs, 

and X13 are constants except for 3 observations. Thus, as in [1] and [13], 
we use the 374 observations with crime rate smaller than 3.2 in this analysis. 
To make the explanatory variables comparable in scale, we standardize each 
of them individually to mean and variance 1. The linearity condition is 
assumed given the arguments in [13]. As in the simulation studies and the car 
price data example, the quadratic spline (order m = 3) with A:„ = 4 internal 
knots is used to generate the B-spline basis functions '/r(?/). 

By CANCOR, there are four e.d.r. directions that are significant. A closer 
look at those directions showed that the 3rd and 4th directions are mainly 
due to outlying observations. To downweight the effect of outliers, we choose 
the weighted CANCOR approach of [21], where each observation is weighted 
according to its leverage in the x-space. With such weights, only two di- 
rections are found to be significant, and the resulting direction estimates 
by weighted CANCOR and the corresponding method, rescaled to unit 
length, are summarized in Table 7. Each of the estimated e.d.r. directions 
by the weighted CANCOR involves all of the explanatory variables. Due to 
the correlations among the variables, the coefficients cannot be interpreted 
individually. The directions estimated by are more focused, and 9 or 10 
coefficients in each direction are compressed to 0. 

Table 7 shows that, in the first constrained direction estimate the 
variables xs, xg, xg and xg are singled out. However, the variable X5 has 
the dominant loading. In the second constrained direction estimate /^^ , the 
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variables 3:5, xq and xn are singled out. Since the variable x^, the housing 
size information, has already been picked up by p(, the second constrained 
e.d.r. direction /?2 gets at how old and how wealthy the district is. We 
conclude that the housing size information and the district information (age 
and wealth) are well summarized by x'^(3( and x'^/32 to explain the variation 
of the median housing values among those 374 districts. 

The direction estimates Pi by the weig hted CANCOR and /?/ by are 
quite close for the Boston housing data with correlations 0.982 between x^/3i 
and x'^Pi and 0.849 between x'^P2 and x^/?! ■ However, with an acceptable 
sacrifice on the canonical correlations, the irrelevant or marginally important 
variables are filtered out, and the direction estimates by are easier to 
interpret. 

6. Conclusion. The constrained canonical correlation (C^) method is 
proposed to enhance the e.d.r. direction estimates by CANCOR. By im- 
posing the Li-norm constraint in the method, the noise contained in the 
data could be further filtered out and a small number of informative vari- 
ables could be easily identified for each direction estimate. The directions 
estimated by the proposed method are easier to interpret and more help- 
ful for the subsequent statistical analysis. Compared with the SSIR method 
and a recent work [10] on another shrinkage- type method, the proposed 
method has a unique advantage in that we use correlation as a transpar- 
ent objective so that there is a natural criterion to guide us how much to 
"shrink." The proposed method results in final parameter estimates that 
are at least as efficient as the CANCOR estimates, but adapt nicely to spar- 
sity in these directions whenever possible. Asymptotic properties for other 
shrinkage methods are unavailable for comparison at the moment. 

An R interface with a Fortran subroutine is used to implement the pro- 
posed method in this paper. The programs can be obtained from the 
first author upon request. 

APPENDIX 

Proof of Theorem 1. Given a sample {Xt,Yt}f^i, we standardize 

" —1/2 - - ~ 

Xt to Zt = ^ccx' {Xt - X), where {X, 

^xx) are the mean and covariance 
estimates. Using the variables Z = {Zi, . . . , Zn)'^ as in [8], the direction es- 

" —1/2 

timates by CANCOR are Pi = Tjxx fji, where fji are the eigenvectors of the 
matrix A„ = n~^Z^n*(n*^n*)~^n*"^Z and 11* is the centered version of 
U = {TT{Yi),...,TT{Yt)f. Letting M = Cov{E{z\y)) = E[E{z\y)E{z^\y)] and 

— 1/2 

(rji^Xi) are the eigenvectors and eigenvalues of M, we have (3i = J^xx Vi 

1/2 P 
7i = A ■ .It fohows from the same arguments used in [8] that M and 
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To show Pf Pi, it is equivalent to show that any subsequence {/Jf^^^.} of 

{Pf}, f^fe ^ N, contains a further subsequence that converges almost surely 

to Pi- We handle this for each i. 

Let {6ti,Pi) and {a1,Pf) be the direction estimates from CANCOR and 

C^, respectively, and 7^ = ajTiT^xPi and 7f = af^Ti-^xPi- By the choice of 

the tuning parameter t, we have (7^ — 7j) ^ 0, and thus 7? ^ 7i. Given 

p 

the fixed dimensionality of S^xj we also have T,xx '^xx under condition 
^3. Therefore, for any subsequence n/c, we can find a further subsequence 
f^k' ^ nk such that it^ny^li ^S^,!, almost surely. 

Let Oi = {uj: 7f^^^, (w) ^ 7^ and Sis.nfe, (t^) ^ S^.^.}. It is clear that P{Oi) = 
1. To prove Theorem 1, it suffices to show that Oj C {u; : /3f (w) —>/?«}. 

Otherwise, there exists some ojq G Oj such that Piny ('^o) ^ Pi does not 
hold, bmce i-ixx >0, '^xx,ny (wo) ^ Sxx, and M ( (wo)/3f,i,, (wo) = 1, 

the sequence (wo) must be bounded. Thus, there exists a subsequence 
k,ny, ('^o)^/?* / ft with Uk" C nfc. and P*^^xxP* = 1- 

Let A(wo) = {a : a^S^^,„^,„ (wo)a = 1, a'^t^-,r,ny, {^o)alny, =0,1 = 1,..., 
i — 1}, and write Sny, = '^■Kx,nyi{'^o) for simplicity, we have as n^// — > 00, 

(2) max |«^S^,,„^,,(wo)(/3^„^„(a;o) -/?*)! -0. 

Furthermore, let 

"nw, (^^o) = arg niax oFSny,P*, 

^ aGA(a;o) 

"n,„ K) = arg max a^5„,, (wo). 

We then have 

a'lS^0)SnyXKnyX^0)-P*) 

< max {a^Sny, Pi ny, i^o)} - max {q'^S'„ „/?*} 

< ^J, {uo)Sny,Plny, (UJO) " ^J, (wo)S„^,,/3* 



a':,l,{^0)Sny,{Plny,{cO0)-P*) 



which implies 



max ia^SnyJlny, (^^o)} - max {a^Sny,P*} 
< maKySny,{Pln^,XiOo)-P*)\- 
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(wo) = max^gA(a;o){"^'S'„^-"A>fc//(^o)}, it follows from (2) that 
%>j,„(^o) - max„gA(^g){a'^S'„^„/?*} 0, and thus, af^„ (a;o)5'„^„/5* ^ 7i- 

Letting KyXv) = ^{y)n^„<^'ny,{^o), where 7r(y)„^„ denotes the vector of 
the B-s|)line basis functions generated based on the n^// observations, we 
have Cov {x"^ (3* , hn^„ (y)) — > 7i as Uk" — > oo, where Gov is used in the proof 
to denote the sample covariance. We also have Cov(/i,i^„ (y), /irij.// (y)) = 1 by 
the construction of h, so 

[C^v(/i„^„(y),V,(y))]i/2 
We rescale the continuous functions /i„^.„ (y) to get 

K,.„ (y), if max \hn.„ (y)| < 1, 



9ny, (y) 



/infe/, (y) / max \hny, {y)\, if max (y)| > 1, 



to ensure \gny,{y)\ < 1 on [a, 6], and < Cov{gny,{y),gny,{y)) < 1- Replacing 
hny,{y) by S-n^^ly) in (3), we have 

(4) C^w{x^(3\gny, {y)) - l^[6^^{gny, (y), 9n,„ {y)t'^ ^ 0. 

Given condition and the fact \gny,{y)\ ^ li it is easy to see that the 
sample covariances converge to the population counterparts, and therefore 
Coy{x^ P*,gny,{y)) - 7^[Gov(y,^„(y),g„^„(y))]i/2[Cov(x^/?*,x^/?*)]i/2 4 0, 
and 

(5) Gorr(2;'^/3*,g„^„(y)) ^7i. 

We define R^iP) = maxT(y)eH CoY?{x^p, T{y)) for any (3 with x 
'^xxP = 1) where H is the set containing any transformation of y with 
E[T'^{y)] < 00. As in [1], it can be shown that R^{f5) = [3'^ Co\[E{x\y)]l3 . 
Since Co\[E{x\y)] has distinct nonzero eigenvalues and the function R'^{(3) 
is continuous, we have that for any ei > 0, there exists £2 > such that 

(6) pen,{ei)^ Gorr(x^/3, T{y)) < 7^ - £2 for every T{y) G H, 

where f],(ei) = {/? : = 1, = 0, / < i - 1, >ei}. 

For i = 1, it is immediate to see that (5) contradicts to (6). We now 

p 

consider i > 1 by iteratively using the fact that (3f — > /3; for I <i — 1. By 
taking hmits on both sides of PfJ^^,, {uJo)Sny, f^l^y, i^o) = 0, we get f3*T,^^f3i = 
for I < i — 1. Thus /3* falls into r2j(ei) for some positive ei, which again 
leads to a contradiction to (6). □ 

Proof of Theorem 2. (i) If Pl^{d) (for some d<p) does not contain 
all the variables with nonzero coefficient in Pi, the consistency result in 



20 



J. ZHOU AND X. HE 



Theorem 1 and (6) imply that the corresponding will be strictly below 7^ 
for sufficiently large n, and therefore the filtering procedure will favor the 
full set at d = p. Thus the result (i) of Theorem 2 follows. 

(ii) As a result of (i), the union of the selected variables from the filtering 
procedure contains all of the variables involved in the e.d.r. directions {Pi}, 
with probability tending to 1. 

Except for oj € On with P[On) = 5^ — > as n — > cx), the final direction 
estimate from the proposed method is the same as the CANCOR directions 
on the relevant variables (with nonzero coefficients in any of the /Jj's), that 
is, we have = except for w G 0„. Thus - ^ C] - 

P{V^iPi,i ~ ^ C}\ ^ for any C, and — has the same 

limiting distribution as that of -v/n(A,i — On the zero coefficients Pij 
with j > 2, it is either pfj = or = Pij, so the stochastic dominance is 
immediate. □ 
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